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ABSTRACT 

This paper describes in detail the bitonic sort algorithm,and implements 
the bitonic sort algorithm based on cuda architecture.At the same time,we 
conduct two effective optimization of implementation details according 
to the characteristics of the GPU,which greatly improve the efficiency. 
Finally,we survey the optimized Bitonic sort algorithm on the GPU with 
the speedup of quick sort algorithm on the CPU.Since Quick Sort is not 
suitable to be implemented in parallel,but it is more efhcient than other 
sorting algorithms on CPU to some extend.Hence,to see the speedup and 
performance,we compare bitonic sort on GPU with quick Sort on CPU. 

For a series of 32-bit random integer,the experimental results show that 
the acceleration of our work is nearly 20 times.When array size is about 
2^®,the speedup ratio is even up to 30. 
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1 INTRODUCTION 

Sorting is a well-studied topic and a fundamental problem in computer science. 
Sorting appears as an internal step of many programs and processes. According 
to statistics data, it indicates that more than 25 percent of time CPU costs 
is sorting.Hence, efficient sorting algorithm and implementations are one of the 
basic keys for performance improvement. Currently, there are many sorting algo¬ 
rithms,Bubble sort,Odd-even sort. Insertion sort,Heap sort,Selection sort,Radix 
sorting,Quick sort,Bitonic sort... 

There is an increasing need for programming to address parallelism on a 
variety of architectual approaches. However,due to some physical limitation of 
technology and material,the frequency level of single-core CPU is also influ¬ 
enced.Today’s graphics cards contain very powerful multi-core processors. The 
processors are specialized for compute-intensive,highly parallel computations. 
They could be used to assist the CPU in solving problems that can be efficiently 
data-paralleled. Meanwhile,more and more and more scientists,researchers and 
software developers are using GPU to accelerate their algorithms and applica¬ 
tion. 


1 



GPU processing power and memory bandwidth has obvious advantages 
in terms of cost and power consumption do not need to pay too much with 
respect to CPU.Due to the highly parallel graphics rendering, making GPU 
processing power can be increased by increasing the memory bandwidth and 
parallel processing unit and a memory unit of the control mode. GPU designers 
put more transistors used as an execution unit, not like CPU used as a complex 
control unit and a cache in order to improve the efficiency of a small number 
of execution units. CPU integer calculation, branch, logic and floating-point 
arithmetic operations are performed by different units, in addition to a floating 
point accelerator. Thus, CPU havs different performance to compute tasks of 
different types. While GPU computes tasks of different types by an integer and 
floating-point arithmetic unit, so the power of integer GPU computes is similar 
to its floating-point capability. At present, the mainstream has adopted a uni¬ 
fied architecture GPU unit. In adition, GPU computing has a huge advantage, 
its memory subsystem. The Ultra-high bandwidth of memory not only makes 
tremendous floating-point capability keep a stable throughput, but guarantees 
the efficient operation of data-intensive tasks. This is the why GPU is more and 
more popular in video games. Film industry, industrial design, medical imaging, 
space exploration, telecommunications, etc. 


2 CUBA 

2.1 CUDA Program Model 

CUBA is a parallel programming model and software environment for NVIDIA’s 
GPUs. CUDA allows programmer to program kernels executed on the GPU. 
When one kernel is executed in parallel, a number of CUDA threads are also 
created at the same time. Threads are organized in warps and blocks. A warp 
is a hxed-size group of threads (32 threads on GPUs with compute capability 
1.x, 2.0), while a block consists of up to 16 (1.x) or 32 (2.0) warps. The number 
of warps per block and the number of blocks (and therefore the total number of 
threads) can be specified for each individual kernel launch. When calling kernal, 
the data transmitted from the CPU main memory to the GPU memory, and is 
sent back to CPU after calculation process. 

2.2 CUDA Memory Model 

A thread that executes a kernel has access to high-performance thread local 
registers with the same lifetime as the lifetime of the thread. Each thread block 
has a shared memory (visible to all threads of the block) with the lifetime of 
the block. The shared memory is as fast as registers, as long as there are no 
conflicting accesses by threads of the same block. All threads have random 
access to global memory that has the lifetime of the application, i.e. data reside 
in global memory for several kernel launches. In general, global memory is 
signihcantly slower than access to registers or shared memory. When threads 
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of the same half-warp (upper or lower 16 threads of a warp) access the global 
memory simultaneously, the accesses can be coalesced into a single memory 
transaction, if the accessed memory lies in the same global memory segment. If 
not, the simultaneous access causes multiple sequential memory transfers. 

CUBA provides a barrier mechanism for synchronization of threads of the 
same block, but does not provide any synchronization mechanisms for threads 
of different blocks. In general,the host is not blocked when launching a kernel, 
but it can explicitly wait for a kernel to finish(host synchronization). Since 
variables reside in global memory for multiple kernel launches, tasks can be 
synchronized by distributing the workload to multiple kernel launches using 
host synchronization, while keeping information in global memory. 

Efficient synchronization and memory access are two of the most important 
factors that influence the performance of a GPGPU application. Since kernel 
launches cause small startup delay, the application should use GUDA s barrier 
mechanism instead of host synchronization or even better should avoid explicit 
synchronization. The host can only access the global memory of the GPU, thus 
global memory access cannot be completely avoided, but the global memory 
accesses of a kernel should be reduced to a minimum. 



Figure 1: cuda memory model 

3 BITONIC SORT ALGORITHM 

3.1 Algorithm Description 

Bitonic sort is a binary merge sort, used in the parallel processor with a good 
parallel performance.Bitonic sequence, for example, 1,5,9,10,12,8,7,2 is a bitonic 
sequence,the first half of sequence is Monotonically increasing, the second half 
of this sequence is Monotonically decreasing. Similarly,12,8,7,2,l,5,9,10is also a 
bitonic sequence. Let A be an arbitrary input sequence to sort and let n = 2^ 
be the length of A. The process of sorting A then consists of k phases. The 
subsequences of length 2 
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(A[0],A[l]),(a[2],A[3]),...,A[n-2],A[n-l]) 

are bitonic sequences by definition. In the first phase these subsequences are 
sorted using bitonic merge (as shown above) alternating descending and ascend¬ 
ing, which makes the subsequences of length 4 bitonic sequences: 

(A[0],A[l],A[2],A[3]),...,(A[n-4],A[n-3],A[n-2],A[n-l]) 

In the second phase these subsequences of length 4 are sorted alternating de¬ 
scending and ascending, resulting in subsequences of length 8 being bitonic 
sequences. In the r th phase of bitonic sort the total number of subsequences 
being sorted is 2*“’’ and the length of each of these subsequences is 2’’. Sorting 
a sequence of length 2'’ using bitonic merge consists of r steps. After the [fc.ljth 
phase sequence A is a bitonic sequence. A is sorted in the last phase k. Figure 2 
shows the bitonic sorting network for input sequences of size n = 8 . The sorting 
network consists of 3 = logS phases, phase p having p steps. Every step consists 
of 4 = n/2 compare/exchange operations. 
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Figure 2: Bitonic sorting network 


3.2 Algorithm Analysis 

To our knowledge,quick sort algorithm often used for sorting on the CPU, and 
high efficiency.Although the average time complexity of quick sort is 0{nlogn), it 
may not be suitable for implementation on the GPU. 

Based on the introduction of section 3.1,we know the time complexity of 
bitonic sort is 0(n(logn)^), it is very suitable for implementation on the GPU 
for the reasons: the fixed and stable procedure, data independent, and no data 
cross in each round. It needs J2\=i * = iogn(iogn+i) g^me time, 

it completes i * 2 = — 4 - - compare-exchange operations. 

3.3 Implementation 

Suppose an array of length N = 2^, the sequence needs k{k -\- l)/2 rounds. 
And each round calls a kernal function, which completes N/2 compare-exchange 


4 












operations. All these operations can be done by a lot of threads created parallelly, 
after the end, data is synchronized. Successively,then next kernal is called. 

4 OPTIMIZATIONS 

The method described above has two inevitable shortcomings: too many kernals 
are called, access global memory consumes too much time, i.e. a long delay.So 
we consider these two aspects of optimization, reducing the number of memory 
accesses and the number of kernal launches . 

4.1 Optimizationliusing shared memory 

In our implementation,every step needs N/2 threads,and each thread completes 
a compare-exchange operation.If a subsequence od length 2^ processed in step 
s completely fits into shared memory of one block,the block first transfers the 
subsequence into shared memory.That way we can process the steps s,s-I,...,l 
using the shared memory and efficient block-synchronization mechanisms be¬ 
tween multisteps instead of host-synchronization in one kernal function.So time 
of accessing global memory is decreased. 

4.2 Optimization2:using the register 

Another optimization is using register.According to the Figure 2,in the phase 3, 
4 address of [0,2,4,6] are used in two successive step 3 and step 2 that we can 
attach to the register, so that we can reduce two steps to one. Analogously,using 
this methods can reduce the number of kernal launches and the number of access 
global memory. 


5 RESULTS 

The test result is obtained on a experimental Platform with CPU (Intel Xeon E5- 
2620) and GPU(Kepler architecture, KIO). Experimental data is 32-bit random 
integer from 128K to 256M. As shown in below tablel. 

Table I shows that quick sort is much faster than bitonic sort on the CPU,this 
is due to the differences in time complexity. After our two step optimization,the 
efficiency of bitonic sort algorithm on the GPU has increased significantly. 

6 CONCLUSION AND FUTURE WORK 

We have presented an efficient implementation of bitonic sort based on CUBA.To 
achieve this we carefully optimized our implementation with respect to the num¬ 
ber of accesses to global memory and the number of kernal launches. For a series 
of 32-bit random integer,our experimental results show that the acceleration of 
CPU/GPU is more than 20 times. 
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Table 1: results 


Array size 

CPU Times(ms) 

GPU BitonicSort Times(ms) 

Ratio 

Quicksort 

BitonicSort 

Basic 

Semi 

Optimized 

I28A: 

— 

30.00 

0.76 

0.46 

0.36 

- 

256K 

20.00 

60.00 

1 .2! 

0.87 

0.66 

30.2 

521K 

30.00 

IIO.OO 

2.22 

1.78 

1.3! 

22.7 

IM 

80.00 

250.00 

4.58 

3.89 

2.80 

28.5 

2M 

150.00 

550.00 

8.90 

7.95 

5.87 

25.5 

4M 

280.00 

1230.00 

18.14 

16.59 

12.30 

22.7 

8 M 

590.00 

2670.00 

38.13 

35.29 

26.36 

22.3 

I6M 

1230.00 

5880.00 

80.09 

75.52 

56.27 

21.8 

32M 

2570.00 

12900.00 

173.77 

162.56 

120.93 

21.3 

64M 

5360.00 

27780.00 

373.52 

350.87 

258.6! 

20.7 

I28M 

III80.00 

59860.00 

803.16 

756.94 

553.49 

20.1 

256M 

23260.00 

128660.00 

1727.23 

1631.92 

1185.02 

19.6 


^ Basic : no optimized. 

^ Semi : optimization!. 

^ Optimized : optimization! and optimization2. 

^ Ratio : acceleration ratio = Times(CPU Quick Sort)/Times(GPU Bitonic Sort) 


Due to our current work is a relatively simple, so we will strengthen two main 
directions in the future work.The first is to test different types of data,such as 
64-bit integer,32-bit float,64-bit double.The second is to further explore and 
compare the performance of a multicore GPU bitonic sort implementation. 
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